Machine learning comes in two varieties:
\[(X_1, y_1), (X_2, y_2), ..., (X_n, y_n)\]
Define a cost of mismatch between \(y_{i}^{\text{pred}}=f(X_i|\theta)\) and \(y_{i}^\text{actual}\):
\[\begin{equation} C(f(X_i|\theta), \;y_{i}^\text{actual}) \geq 0. \end{equation}\]
For cat / dog example, we could do something like: - \(C(1, 0) = C(0, 1) = 1\) - \(C(0, 0) = C(1, 1) = 0\)
Then choose \(\theta\) to minimise costs.
\[\begin{equation} y_i = \alpha + \beta x_i + \epsilon_i \end{equation}\]
where \(\epsilon_i\) is an error term. Define mean-squared loss:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (y_i - (\alpha + \beta x_i))^2 \end{equation}\]
determine \(\hat{\alpha}\) and \(\hat{\beta}\) as those minimising \(L\):
\[\begin{align} \frac{\partial L}{\partial \alpha} &= -\frac{2}{K}\sum_{i=1}^{K} (y_i - (\alpha + \beta x_i)) = 0\\ \frac{\partial L}{\partial \beta} &= -\frac{2}{K}\sum_{i=1}^{K} x_i (y_i - (\alpha + \beta x_i)) = 0 \end{align}\]
although a closed form expression exists for \(\hat{\alpha}\) and \(\hat{\beta}\), for more general models, one doesn’t exist \(\implies\) use gradient descent optimisation
\[\begin{align} \alpha &= \alpha - \eta \frac{\partial L}{\partial \alpha}\\ \beta &= \beta - \eta \frac{\partial L}{\partial \beta} \end{align}\]
until \(\alpha\) and \(\beta\) no longer change. \(\eta\) is the learning rate
\[\begin{equation} y_i = \theta_0 + \theta_1 x_i + \theta_2 x_i^2 + ... + \theta_p x_i^p + \epsilon_i \end{equation}\]
model is better able to fit more complex datasets
\[\begin{equation} L = C||\theta||_q + \frac{1}{K} \sum_{i=1}^{K} (y_i - f_p(x_i))^2 \end{equation}\]
where \(||.||_q\) denotes the \(L_q\) norm: different choices can yield very different estimates
for new data point \(\tilde x_i\):
for new data point \(\tilde x_i\):
many options possible. Common metrics include:
\[\begin{equation} s(x_1,x_2) = \frac{x_1.x_2}{|x_1||x_2|} \end{equation}\]
assume
\[\begin{equation} \boldsymbol{x} \sim \mathcal{N}(0, I) \end{equation}\]
where \(I\in\mathbb{R}^d\). What does the distribution of Euclidean distances between points look like as \(d\) changes?
\[\begin{equation} y_i \sim \text{Bernoulli}(\theta_i) \end{equation}\]
where \(0\leq \theta_i \leq 1 = Pr(y_i=1)\)
is given by:
\[\begin{equation} \text{Pr}(y_i|\theta_i) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i} \end{equation}\]
so that \(\text{Pr}(y_i=1) = \theta_i\) and \(\text{Pr}(y_i=0) = 1 - \theta_i\)
In logistic regression, we use logistic function:
\[\begin{equation} \theta_i = f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_i))} \end{equation}\]
assume data are i.i.d., the likelihood is:
\[\begin{equation} L=p(\boldsymbol{y}|\beta,\boldsymbol{x}) = \prod_{i=1}^{K} f_\beta(x_i)^{y_i} (1 - f_\beta(x_i))^{1 - y_i}. \end{equation}\]
Can use gradient descent to find maximum likelihood estimates (or estimate using Bayesian inference).
straightforward to extend the model to incorporate multiple regressions:
\[\begin{equation} f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))} \end{equation}\]
But how to interpret parameters of logistic regression?
another way of writing logistic function:
\[\begin{align} f_\beta(x_i) &= \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))}\\ &= \frac{\exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})}{1 + \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})} \end{align}\]
so that
\[\begin{align} 1 - f_\beta(x_i) = \frac{1}{1 + \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})} \end{align}\]
taking the ratio:
\[\begin{equation} \text{odds} = \frac{f_\beta(x_i)}{1-f_\beta(x_i)} = \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}) \end{equation}\]
so that
\[\begin{equation} \log\text{odds} =\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i} \end{equation}\]
meaning (say) \(\beta_1\) represents the change in log-odds for a one unit change in \(x_{1}\)
All available on SOLO:
Coursera: